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IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

Applicant: Stephane Lubiarz et al. § Group Art Unit: 

§ 

Serial No.: PCT/FR00/02220 § 

§ 

Filed: August 2, 2000 § 

§ 

For: METHOD AND DEVICE FOR § Atty. Dkt. No.: MATR-0018-US 
DETECTING VOICE ACTIVITY § 

Box PCT 

Commissioner for Patents 
Washington DC 2023 1 

PRELIMINARY AMENDMENT 

Sir: 

Prior to Examination, please amend the above-identified application as follows 
In the Specification: 

Page 1, at line 2, please insert the following paragraph: 
--BACKGROUND OF THE INVENTION-- 

Page 2, at line 3, please insert the following paragraph: 
-SUMMARY OF THE INVENTION- 



Page 2, delete lines 36-37. 



Page 3, delete lines 1-2. 



Page 3, at line 3, please insert the following paragraph: 
-BRIEF DESCRIPTION OF THE DRAWINGS— 

Page 3, at line 25, insert the following paragraph: 
-DETAILED DESCRIPTION- 



Page 4, line 32, please replace the formula with the following: 

S 1 y g 

f(i)-f(i-l) fe[f( 4fwf 
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Page 5, line 13, please replace the formula with the following (formula has not 
changed but replaced for clarity): 



Page 6, line 9, please replace the formula with the following (formula has not changed 
but replaced for clarity): 

Hp„„ = S -'- a --- 6 -- 



Page 7, line 3, please replace the formula with the following (formula has not changed 
but replaced for clarity): 

imax(j> 

E u,i= S [f(i)-f(i-l)].E Pli 2 n>i 

i=imin(j) 

Page 7, line 4, please replace the formula with the following (formula has not changed 
but replaced for clarity): 

imax(j) 

E 2-J = .Z^-fO-OLEp^ 



Page 7, line 25, please replace the formula with the following (formula has not 
changed but replaced for clarity): 

E 1)nJ =^E 1)n _ 1|j +(l-^).E lnJ 

Page 7, line 26, please replace the formula with the following (formula has not 
changed but replaced for clarity): 

E 2>n>j =>v.E 2>n _ lij +(l-^)£ 2>iiij 

Page 11, line 11, please replace the formula with the following (formula has not 
changed but replaced for clarity): 

S n ; =max (s n j -a .B n _j ; ; (3 .B^ ; ) 
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Page 1 1, line 27, please replace the formula with the following (formula has not 
changed but replaced for clarity): 



In the Claims; 

Cancel claims 14 and 15, without prejudice. 

Amend the following claims: 

1 1. (Amended) Method for detecting voice activity in a digital speech signal in 

2 at least one frequency band, wherein the voice activity is detected on the basis of an analysis 

3 comprising the step of comparing two different versions of the speech signal, wherein at least 

4 one of said versions is a denoised version obtained by taking account of estimates of noise 

5 included in the signal. 

1 2. (Amended) Method according to claim 1, wherein said comparison is 

2 performed on respective energies, evaluated in said frequency band, of the two different 

3 versions of the speech signal, or to a monotonic function of said energies. 

1 3. (Amended) Method according to claim 1, wherein said analysis further 

2 comprises a time smoothing of the energy of one of said versions of the speech signal, and a 

3 comparison between the energy of said version and the smooth energy. 

1 4. (Amended) Method according to claim 3, wherein the comparison between 

2 the energy of said version and the smooth energy controls transitions of a voice activity 

3 detection automaton from a speech state to a silence state, and wherein the comparison of the 

4 two different versions of the speech signal controls transitions of the detection automaton 

5 from the silence state to the speech state. 

1 5. (Amended) Method according to claim 1, wherein the two different 

2 versions of the speech signal are two versions denoised by non-linear spectral subtraction, 

3 wherein a first of the two versions is denoised in such a way as not to be less, in the spectral 

4 domain, than a first fraction of a long-term estimate representative of a noise component 

5 included in the speech signal, and the second of the two versions is denoised in such a way as 
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not to be less, in the spectral domain, than a second fraction of said long-term estimate, 
smaller than said first fraction. 



1 6. (Amended) Method according to claim 5, wherein said analysis further 

2 comprises a time smoothing of the energy of each of the two versions of the speech signal, by 

3 means of a smoothing window determined by comparing the energy of the second of the two 

4 versions with the smoothed energy of the second of the two versions. 

1 7. (Amended) Method according to claim 6, wherein the smoothing window is 

2 an exponential window defined by a forgetting factor. 

1 8. (Amended) Method according to claim 7, comprising the step of allocating 

2 a substantially zero value to the forgetting factor when the energy of the second of the two 

3 versions is less than a value of the order of the smoothed energy of the second of the two 

4 versions. 

1 9. (Amended) Method according to claim 8, comprising the step of allocating 

2 a first value substantially equal to 1 to the forgetting factor when the energy of the second of 

3 the two versions is greater than said value of the order of the smooth energy multiplied by a 

4 coefficient bigger than 1, and allocating a second value lying between 0 and said first value to 

5 the forgetting factor when the energy of the second of the two versions is greater than said 

6 value of the order of the smoothed energy and less than said value of the order of the 

7 smoothed energy multiplied by said coefficient. 

1 10. (Amended) Method according to claim 1, wherein the first and second 

2 fractions correspond substantially to attenuations of 1 0 dB and 60 dB, respectively. 

1 11. (Amended) Method according to claim 1 , wherein the comparison of the 

2 two different versions of the speech signal is performed on respective differences between the 

3 energies of said two versions in said frequency band and a lower bound of the energy of the 

4 denoised version of the speech signal in said frequency band. 

1 12. (Amended) Method according to claim 11, wherein one of the two different 

2 versions of the speech signal is a non-denoised version of the speech signal. 
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1 13. (Amended) Device for detecting voice activity in a speech signal, 

2 comprising signal processing means for analyzing the speech signal in at least one frequency 

3 band, wherein the processing means comprise means for comparing two different versions of 

4 the speech signal, wherein at least one of said versions is a denoised version, obtained by 

5 taking account of estimates of noise included in the signal. 

Add the following claims: 

1 16. (New) Device according to claim 13, wherein the processing means comprise 

2 means for evaluating, in said frequency band, energies of said two different versions of the 

3 speech signal, whereby inputs of the comparison means comprise said energies or a 

4 monotonic function of said energies. 

1 17. (New) Device according to claim 13, wherein the processing means further 

2 comprises means for performing a time smoothing of the energy of one of said versions of the 

3 speech signal, and means for comparing the energy of said version and the smoothed energy. 

1 18. (New) Device according to claim 17, wherein the processing means comprise 

2 a voice activity detection automaton having a plurality of states including a speech state and a 

3 silence state, means for controlling transitions of the voice activity detection automaton from 

4 the speech state to the silence state based on a comparison between the energy of said version 

5 and the smoothed energy, and means for controlling transitions of the voice activity detection 

6 automaton from the silence state to the speech state based on a comparison of the two 

7 different versions of the speech signal. 

1 19. (New) Device according to claim 13, further comprising first non-linear 

2 spectral subtraction means to provide a first of the two versions of the speech signal as a 

3 denoised version which is not less, in the spectral domain, than a first fraction of a long-term 

4 estimate representative of a noise component included in the speech signal, and second non- 

5 linear spectral subtraction means to provide a second of the two versions of the speech signal 

6 as a denoised version which is not less, in the spectral domain, than a second fraction of said 

7 long-term estimate, said second fraction being smaller than said first fraction. 
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1 20. (New) Device according to claim 19, wherein the processing means further 

2 comprises means for performing a time smoothing of the energy of each of the two versions 

3 of the speech signal, by means of a smoothing window determined by comparing an energy 

4 of the second of the two versions with the smoothing energy of the second of the two 

5 versions. 

1 21. (New) Device according to claim 20, wherein the smoothing window is an 

2 exponential window defined by a forgetting factor. 

1 22. (New) Device according to claim 2 1 , wherein the processing means further 

2 comprises means for allocating a substantially zero value to the forgetting factor when the 

3 energy of the second of the two versions is less than a value of the order of the smoothed 

4 energy of the second of the two versions. 

1 23. (New) Device according to claim 22, wherein the processing means further 

2 comprises means for allocating a first value substantially equal to 1 to the forgetting factor 

3 when the energy of the second of the two versions is greater than said value of the order of 

4 the smoothed energy multiplied by a coefficient bigger than 1 , and for allocating a second 

5 value lying between 0 and said first value to the forgetting factor when the energy of the 

6 second of the two versions is greater than said value of the order of the smoothed energy and 

7 less than said value of the order of the smooth energy multiplied by said coefficient. 

1 24. (New) Device according to claim 13, wherein the first and second fractions 

2 correspond substantially to attenuations of 10 dB and 60 dB, respectively. 

1 25. (New) Device according to claim 13, wherein the comparison of the two 

2 different versions of the speech signal is performed on respective differences between the 

3 energies of said two versions in said frequency band and a lower bound of the energy of the 

4 denoised version of the speech signal in said frequency band. 

1 26. (New) Device according to claim 25, wherein one of the two different 

2 versions of the speech signal is a non-denoised version of the speech signal. 
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1 27. (New) A computer program product, loadable into a memory associated with 

2 a processor, and comprising portions of code for execution by the processor to detect voice 

3 activity in an input digital speech signal in at least one frequency band, whereby the voice 

4 activity is detected on the basis of an analysis comprising the step of comparing two different 

5 versions of the speech signal, wherein at least one of said versions is a denoised version 

6 obtained by taking account of estimates of noise included in the signal. 



1 28. (New) A computer program product according to claim 27, wherein said 

2 comparison is performed on respective energies, evaluated in said frequency band, of the two 

3 different versions of the speech signal, or to a monotonic function of said energies. 

1 29. (New) A computer program product according to claim 1 , wherein said 

2 analysis further comprises a time smoothing of the energy of one of said versions of the 

3 speech signal, and a comparison between the energy of said version of the smoothed energy. 

1 30. (New) A computer program product according to claim 29, wherein the 

2 comparison between the energy of said version and the smoothed energy control transitions 

3 of a voice activity detection automaton from a speech state to a silence state, and wherein the 

4 comparison of the two different versions of the speech signal controls transitions of the 

5 detection automaton from the silence state to the speech state. 

1 31. (N ew) A computer program product according to claim 27, wherein the two 

2 different versions of the speech signal are two versions denoised by non-linear spectral 

3 subtraction, wherein a first of the two versions is denoised in such a way as not to be less, in 

4 the spectral domain, than a first fraction of a long-term estimate representative of a noise 

5 component included in the speech signal, and the second of the two versions is denoised in 

6 such a way as not to be less, in the spectral domain, than a second fraction of said long-term 

7 estimate, smaller than said first fraction. 



1 32. (New) A computer program product according to claim 3 1 , wherein said 

2 analysis further comprises a time smoothing of the energy of each of the two versions of the 

3 speech signal, by means of a smoothing window determined by comparing the energy of the 

4 second of the two versions with the smoothed energy of the second of the two versions. 
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33. (New) A computer program product according to claim 32, wherein the 
smoothing window is an exponential window defined by a forgetting factor. 



1 34. (New) A computer program product according to claim 33, wherein said 

2 analysis further comprises the step of allocating a substantially zero value to the forgetting 

3 factor when the energy of the second of the two versions is less than a value of the order of 

4 the smoothed energy of the second of the two versions. 

1 35. (New) A computer program product according to claim 34, wherein said 

2 analysis further comprises the steps of allocating a first value substantially equal to 1 to the 

3 forgetting factor when the energy of the second of the two versions is greater than said value 

4 of the order of the smoothed energy multiplied by a coefficient bigger than 1, and allocating a 

5 second value lying between 0 and said first value to the forgetting factor when the energy of 

6 the second of the two versions is greater than said value of the order of the smoothed energy 

7 and less than said value of the order of the smoothed energy multiplied by said coefficient. 

1 36. (New) A computer program product according to claim 27, wherein the first 

2 and second fractions correspond substantially to attenuations of 10 dB and 60 dB, 
i ~ 3 respectively. 

1 37. (New) A computer program product according to claim 27, wherein the 

2 comparison of the two different versions of the speech signal is performed on respective 

3 differences between the energies of said two versions in said frequency band and a lower 

4 bound of the energy of the denoised version of the speech signal in said frequency band. 



1 38. (New) A computer program product according to claim 37, wherein one of the 

2 two different versions of the speech signal is a non-denoised version of the speech signal. 



Remarks: 

Allowance of all claims is respectfully requested. The Commissioner is authorized to 
charge any additional fees under 37 C.F.R. § 1.16 and § 1.17, or credit any overpayment to 
Deposit Account No. 20-1504 (MATR-0018-US). 



Respectfully submitted, 



Date:. 




Dan C. Hu, Registration No. 40,025 



TROP, PRUNER & HU, P.C. 
8554 Katy Freeway, Suite 100 
Houston, Texas 77024-1805 
(713) 468-8880 [Phone] 
(713) 468-8883 [Fax] 
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VERSIONS WITH MARKINGS TO SHOW CHANGES 



IN THE CLAIMS : 

Claims 14-15 have been cancelled. New claims 16-38 have been added. 
Amendments of the claims are indicated below: 

1 1 . (Amended) Method for detecting voice activity in a digital speech signal 

2 [(s)] in at least one frequency band, [characterized in that] wherein the voice activity is 

3 detected on the basis of an analysis comprising [a comparison, in the said frequency band,] 

4 the step of comparing two different versions of the speech signal, [one] wherein at least one 

5 of [which] said versions is a denoised version obtained by taking account of estimates of [the] 

6 noise included in the signal. 



1 2. (Amended) Method according to claim 1, [in which the] wherein said 

2 comparison [pertains to] is performed on respective energies [(E] , n j, E 2 ,nj)], evaluated in [the] 

3 said frequency band, of the two different versions of the speech signal, or to a monotonic 

4 function of [the] said energies. 

1 3. (Amended) Method according to claim 1 , wherein [or 2, in which the] said 

2 analysis [furthermore] further comprises a [temporal] time smoothing of the energy [(Ei, n j)] 

3 of one of [the] said versions of the speech signal, and a comparison between the energy of 

4 [the] said version and the smooth energy [(Ei >n j)]. 

1 4. (Amended) Method according to claim 3, [in which] wherein the 



2 comparison between the energy of [the] said version [(Ei, n j)] and the smooth energy [(Ei, n j)] 

3 controls [the] transitions of a voice activity detection automaton from a speech state to a 

4 silence state, [whilst] and wherein the comparison of the two different versions of the speech 

5 signal controls [the] transitions of the detection automaton from the silence state to the speech 

6 state. 



1 5. (Amended) Method according to [any one of claims 1 to 4, in which] claim 

2 1, wherein the two different versions of the speech signal are two versions denoised by non- 

3 linear spectral subtraction, wherein a first of the two versions [(Epi, n ,0 being] is denoised in 

4 such a way as not to be less, in the spectral domain, than a first fraction [(fil*)] of a long-term 

5 estimate [( B „,*)] representative of a noise component included in the speech signal, and the 



6 second of the two versions [(Ep 2 , n ,i) being] is denoised in such a way as not to be less, in the 

7 spectral domain, than a second fraction [(B2j)] of [the] said long-term estimate, smaller than 

8 [the] said first fraction. 

1 6. (Amended) Method according to claim 5, [in which a temporal] wherein 

2 said analysis further comprises a time smoothing of the energy of each of the two versions of 

3 the speech signal [is performed], by means of a [determined] smoothing window determined 

4 by comparing the energy [(E 2 , n j)] of the second of the two versions with the smoothed energy 

5 [E 2) nj)] of the second of the two versions. 

1 7. (Amended) Method according to claim 6, [in which] wherein the smoothing 

2 window is an exponential window defined by a [forget] forgetting factor [(A,)]. 

1 8. (Amended) Method according to claim 7, [in which the forget factor (A) 

2 has] comprising the step of allocating a substantially zero value [(A r )] to the forgetting factor 

3 when the energy [(E 2 , n j)] of the second of the two versions is less than a value of the order of 

4 the smoothed energy [E 2 , n j)] of the second of the two versions. 

1 9. (Amended) Method according to claim 8, [in which the forget factor (A.) 

2 has] comprising the step of allocating a first value [(A. q )] substantially equal to 1 to the 

3 forgetting factor when the energy [(E 2 , n j)] of the second of the two versions is greater than 

4 [the] said value of the order of the smooth energy multiplied by a coefficient [(A)] bigger than 

5 1 , and allocating a second value [(Ap)] lying between 0 and [the] said first value to the 

6 forgetting factor when the energy of the second of the two versions is greater than [the] said 

7 value of the order of the smoothed energy and less than [the] said value of the order of the 

8 smoothed energy multiplied by [the] said coefficient. 

1 10. (Amended) Method according to claim 1, wherein [any one of claims 5 to 

2 9, in which] the first and second fractions [(Blj, B2j)] correspond substantially to attenuations 

3 of 10 dB and 60 dB, respectively. 
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1 1 . (Amended) Method according to claim 1, wherein [any one of claims 1 to 
10, in which] the comparison of the two different versions of the speech signal [pertains to] is 



3 performed on respective differences between the energies [(Ei, n j, E2, n j) of these] of said two 

4 versions in [the] said frequency band and a lower bound [(E 2 nrin,j)] of the energy [(E 2 , n ,j)] of 

5 the denoised version of the speech signal in [the] said frequency band. 

1 12. (Amended) Method according to claim 1 1 , [in which] wherein one of the 

2 two different versions of the speech signal is a non-denoised version of the speech signal. 

1 13. (Amended) Device for detecting voice activity in a speech signal, 

2 comprising signal processing means [(15) designed to implement a method according to any 

3 one of claims 1 to 12] for analyzing the speech signal in at least one frequency band, wherein 

4 the processing means comprise means for comparing two different versions of the speech 

5 signal wherein at least one of said versions is a denoised version, obtained by taking account 

6 of estimates of noise included in the signal . 

1 16. (New) Device according to claim 13, wherein the processing means comprise 

2 means for evaluating, in said frequency band, energies of said two different versions of the 

3 speech signal, whereby inputs of the comparison means comprise said energies or a 

4 monotonic function of said energies. 

1 17. (New) Device according to claim 13, wherein the processing means further 

2 comprises means for performing a time smoothing of the energy of one of said versions of the 

3 speech signal, and means for comparing the energy of said version and the smoothed energy. 

1 18. (New) Device according to claim 17, wherein the processing means comprise 

2 a voice activity detection automaton having a plurality of states including a speech state and a 

3 silence state, means for controlling transitions of the voice activity detection automaton from 

4 the speech state to the silence state based on a comparison between the energy of said version 

5 and the smoothed energy, and means for controlling transitions of the voice activity detection 

6 automaton from the silence state to the speech state based on a comparison of the two 

7 different versions of the speech signal. 

1 19. (New) Device according to claim 13, further comprising first non-linear 

2 spectral subtraction means to provide a first of the two versions of the speech signal as a 

3 denoised version which is not less, in the spectral domain, than a first fraction of a long-term 



4 estimate representative of a noise component included in the speech signal, and second non- 

5 linear spectral subtraction means to provide a second of the two versions of the speech signal 

6 as a denoised version which is not less, in the spectral domain, than a second fraction of said 

7 long-term estimate, said second fraction being smaller than said first fraction. 

1 20. (New) Device according to claim 19, wherein the processing means further 



2 comprises means for performing a time smoothing of the energy of each of the two versions 

3 of the speech signal, by means of a smoothing window determined by comparing an energy 

4 of the second of the two versions with the smoothing energy of the second of the two 

5 versions. 

1 21 . (New) Device according to claim 20, wherein the smoothing window is an 

2 exponential window defined by a forgetting factor. 

1 22. (New) Device according to claim 21, wherein the processing means further 

2 comprises means for allocating a substantially zero value to the forgetting factor when the 

3 energy of the second of the two versions is less than a value of the order of the smoothed 

4 energy of the second of the two versions. 

1 23. (New) Device according to claim 22, wherein the processing means further 

2 comprises means for allocating a first value substantially equal to 1 to the forgetting factor 

3 when the energy of the second of the two versions is greater than said value of the order of 

4 the smoothed energy multiplied by a coefficient bigger than 1 , and for allocating a second 

5 value lying between 0 and said first value to the forgetting factor when the energy of the 

6 second of the two versions is greater than said value of the order of the smoothed energy and 

7 less than said value of the order of the smooth energy multiplied by said coefficient. 



1 24. (New) Device according to claim 13, wherein the first and second fractions 

2 correspond substantially to attenuations of 10 dB and 60 dB, respectively. 

1 25. (New) Device according to claim 13, wherein the comparison of the two 

2 different versions of the speech signal is performed on respective differences between the 

3 energies of said two versions in said frequency band and a lower bound of the energy of the 

4 denoised version of the speech signal in said frequency band. 



1 26. (New) Device according to claim 25, wherein one of the two different 

2 versions of the speech signal is a non-denoised version of the speech signal. 

1 27. (New) A computer program product, loadable into a memory associated with 

2 a processor, and comprising portions of code for execution by the processor to detect voice 

3 activity in an input digital speech signal in at least one frequency band, whereby the voice 

4 activity is detected on the basis of an analysis comprising the step of comparing two different 

5 versions of the speech signal, wherein at least one of said versions is a denoised version 

6 obtained by taking account of estimates of noise included in the signal. 



1 28. (New) A computer program product according to claim 27, wherein said 

. 2 comparison is performed on respective energies, evaluated in said frequency band, of the two 

3 different versions of the speech signal, or to a monotonic function of said energies. 

1 29. (New) A computer program product according to claim 1, wherein said 

2 analysis further comprises a time smoothing of the energy of one of said versions of the 

5 3 speech signal, and a comparison between the energy of said version of the smoothed energy. 

1 30. (New) A computer program product according to claim 29, wherein the 

2 comparison between the energy of said version and the smoothed energy control transitions 

3 of a voice activity detection automaton from a speech state to a silence state, and wherein the 

4 comparison of the two different versions of the speech signal controls transitions of the 

5 detection automaton from the silence state to the speech state. 

1 31. (New) A computer program product according to claim 27, wherein the two 

2 different versions of the speech signal are two versions denoised by non-linear spectral 

3 subtraction, wherein a first of the two versions is denoised in such a way as not to be less, in 

4 the spectral domain, than a first fraction of a long-term estimate representative of a noise 

5 component included in the speech signal, and the second of the two versions is denoised in 

6 such a way as not to be less, in the spectral domain, than a second fraction of said long-term 

7 estimate, smaller than said first fraction. 



1 32. (New) A computer program product according to claim 3 1 , wherein said 

2 analysis further comprises a time smoothing of the energy of each of the two versions of the 

3 speech signal, by means of a smoothing window determined by comparing the energy of the 

4 second of the two versions with the smoothed energy of the second of the two versions. 

1 33. (New) A computer program product according to claim 32, wherein the 

2 smoothing window is an exponential window defined by a forgetting factor. 

1 34. (New) A computer program product according to claim 33, wherein said 

2 analysis further comprises the step of aiiocating a substantially zero value to the forgetting 

3 factor when the energy of the second of the two versions is less than a value of the order of 

4 the smoothed energy of the second of the two versions. 

1 35. (New) A computer program product according to claim 34, wherein said 

2 analysis further comprises the steps of allocating a first value substantially equal to 1 to the 

3 forgetting factor when the energy of the second of the two versions is greater than said value 

4 of the order of the smoothed energy multiplied by a coefficient bigger than 1, and allocating a 

5 second value lying between 0 and said first value to the forgetting factor when the energy of 

6 the second of the two versions is greater than said value of the order of the smoothed energy 

7 and less than said value of the order of the smoothed energy multiplied by said coefficient. 

1 36. (New) A computer program product according to claim 27, wherein the first 

2 and second fractions correspond substantially to attenuations of 1 0 dB and 60 dB, 

3 respectively. 

1 37. (New) A computer program product according to claim 27, wherein the 

2 comparison of the two different versions of the speech signal is performed on respective 

3 differences between the energies of said two versions in said frequency band and a lower 

4 bound of the energy of the denoised version of the speech signal in said frequency band. 

1 38. (New) A computer program product according to claim 37, wherein one of the 

2 two different versions of the speech signal is a non-denoised version of the speech signal. 
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METHOD AND DEVICE FOR DETECTING VOICE ACTIVITY 

The present invention relates to digital 
techniques for processing speech signals. It relates 
more particularly to the techniques utilizing voice 
activity detection so as to perform different 
processings depending on whether the signal does or 
does not carry voice activity. 

The digital techniques in question come under 
varied domains: coding of speech for transmission or 
storage, speech recognition, noise reduction, echo 
cancellation, etc. 

The main difficulty with processes for 
detecting voice activity is that of distinguishing 
between voice activity and the noise which accompanies 
the speech signal . 

The document W099/14737 describes a method of 
detecting voice activity in a digital speech signal 
processed on the basis of successive frames and in 
which an a priori denoising of the speech signal of 
each frame is carried out on the basis of noise 
estimates obtained during the processing of one or more 
previous frames, and the variations in the energy of 
the a priori denoised signal are analyzed so as to 
detect a degree of voice activity of the frame. By 
carrying out the detection of voice activity on the 
basis of an a priori denoised signal, the performance 
of this detection is substantially improved when the 
surrounding noise is relatively strong. 

In the methods customarily used to detect voice 
activity, the energy variations of the (direct or 
denoised) signal are analyzed with respect to a long- 
term average of the energy of this signal, a relative 
increase in the instantaneous energy suggesting the 
appearance of voice activity. 

An aim of the present invention is to propose 
another type of analysis allowing voice activity 



WO 01/11605 



- 2 - 



PCT/FR00/02220 



detection which is robust to the noise which may 
accompany the speech signal . 

According to the invention, there is proposed a 
method for detecting voice activity in a digital speech 
signal in at least one frequency band, whereby the 
voice activity is detected on the basis of an analysis 
comprising a comparison, in the said frequency band, of 
two different versions of the speech signal, one at 
least of which is a denoised version obtained by taking 
account of estimates of the noise included in the 
signal . 

This method can be executed over the entire 
frequency band of the signal, or on a subband basis, as 
a function of the requirements of the application using 
voice activity detection. 

Voice activity can be detected in a binary 
manner for each band, or measured by a continuously 
varying parameter which may result from the comparison 
between the two different versions of the speech 
signal . 

The comparison typically pertains to respective 
energies, evaluated in the said frequency band, of the 
two different versions of the speech signal, or to a 
monotonic function of these energies. 

Another aspect of the present invention relates 
to a device for detecting voice activity in a speech 
signal, comprising signal processing means designed to 
implement a method as defined hereinabove. 

The invention further relates to a computer 
program, loadable into a memory associated with a 
processor, and comprising portions of code for 
implementing a method as defined hereinabove upon the 
execution of the said program by the processor, as well 
as to a computer medium, on which such a program is 
recorded. 

Other features and advantages of the present 
invention will become apparent in the following 



WO 01/11605 



PCT/FR00/02220 



description of non- limiting exemplary embodiments, with 
reference to the appended drawings, in which; 

Figure 1 is a schematic diagram of a signal 
processing chain using a voice activity detector 
according to the invention; 

Figure 2 is a schematic diagram of an exemplary 
voice activity detector according to the 
invention; 

Figures 3 and 4 are flow charts of signal 
processing operations performed in the detector of 
Figure 2 ; 

Figure 5 is a graphic showing an exemplary profile 
of energies calculated in the detector of Figure 2 
and illustrating the principle of voice activity 
detection; 

Figure 6 is a diagram of a detection automaton 
implemented in the detector of Figure 2; 
Figure 7 is a schematic diagram of another 
embodiment of a voice activity detector according 
to the invention; 

Figure 8 is a flow chart of signal processing 

operations performed in the detector of Figure 7; 

Figure 9 is a graphic of a function used in the 

operations of Figure 8 . 

The device of Figure 1 processes a digital 
speech signal s. The signal processing chain rep- 
resented produces voice activity decisions 6 n ,j which 
are usable in a manner known per se by application 
units, not represented, affording functions such as 
speech coding, speech recognition, noise reduction, 
echo cancellation, etc. The decisions 5 n ,j can comprise 
a frequency resolution (index j), this making it 
possible to enhance applications operating in the 
frequency domain. 

A windowing module 10 puts the signal s into 
the form of successive windows or frames of index n, 
each consisting of a number N of samples of digital 
signal. In a conventional manner, these frames may 
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exhibit mutual overlaps . In the remainder of the 
present description, the frames will be. regarded, 
without this being in any way limiting, as consisting 
of N = 256 samples at a sampling frequency F e of 8 kHz, 
5 with a Hamming weighting in each window, and overlaps 
of 50% between consecutive windows. 

The signal frame is transformed into the 
frequency domain by a module 11 applying a conventional 
fast Fourier transform algorithm (FFT) for calculating 

10 the modulus of the spectrum of the signal. The module 
11 then delivers a set of N = 2 56 frequency components 
of the speech signal, which are denoted S n ,f/ where n 
designates the current frame number, and f a frequency 
of the discrete spectrum. Owing to the properties of 

15 digital signals in the frequency domain, only the first 
N/2 = 12 8 samples are used. 

To calculate the estimates of the noise 
contained in the signal s, we do not use the frequency 
resolution available at the output of the fast Fourier 

20 transform, but a lower resolution, determined by a 
number I of frequency subbands covering the [0,F e /2] 
band of the signal. Each subband i (l<i<I) extends 
between a lower frequency f(i-l) and an upper 
frequency f(i), with f(0) = 0, and f(I) = F e /2 . This 

2 5 chopping into subbands can be uniform (f(i)-f(i-l) = 

F e /2I) . It may also be non-uniform (for example 
according to a barks scale) . A module 12 calculates the 
respective averages of the spectral components S n , f of 
the speech signal on a subband basis, for example 
30 through a uniform weighting such as: 

This averaging reduces the fluctuations between 

3 5 the subbands by averaging the contributions of the 

noise in these subbands, and this will reduce the 
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variance of the noise estimator. Furthermore, this 
averaging makes it possible to reduce the complexity of 
the system. 

The averaged spectral components S nji are 
addressed to a voice activity detection module 15 and 
to a noise estimation module 16. B n<i denotes the long- 
term estimate of the noise component produced by the 
module 16 in relation to frame n and to subband i. 

These long-term estimates B nfi may for example 
be obtained in the manner described in W099/14737. It 
is also possible to use simple smoothing by means of an 
exponential window defined by a forget factor X B : 

§ rU = M n _ 1 . i +0-X B ).S ni 

with X B equal to 1 if the voice activity detector 15 
indicates that subband i bears voice activity, and 
equal to a value lying between 0 and 1 otherwise. 

Of course, it is possible to use other long- 
term estimates representative of the noise component 
included in the speech signal, these estimates may 
represent a long-term average, or else a minimum of the 
component s n .j over a sufficiently long sliding window. 

Figures 2 to 6 illustrate a first embodiment of 
the voice activity detector 15. A denoising module 18 
executes, for each frame n and each subband i, the 
operations corresponding to steps 180 to 187 of Figure 
3, so as to produce two denoised versions Ep 1#nii , Ep 2 , n ,i 
of the speech signal. This denoising is done by non- 
linear spectral subtraction. The first version Epx n i, 
is denoised in such a way as not to be less, in the 
spectral domain, than a fraction Si ± of the long-term 
estimate B n _ti,i. The second version Ep 2 , nji is denoised 
in such a way as not to be less, in the spectral 
domain, than a fraction S2j of the long-term estimate 
B n -Ti,i- The quantity xl is a delay expressed as a number 
of frames, which may be fixed (for example xl = 1) or 
variable. The more confident one is in the voice 
activity detection, the smaller the delay will be. The 
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fractions 61 i and &2 L (such that Sli>S2i) may be 
dependent on or independent of subband i . Preferred 
values correspond for Sl± to an attenuation of 10 dB, 
and for &2± to an attenuation of 60 dB, i.e. &1 ± * 0.3 
and S2i « 0.001. 

In step 180, the module 18 calculates, with the 
resolution of the subbands i, the frequency response 
Hp n ,i of the a priori denoising filter, according to: 

Mp n „ S ".i- a n-m-Bn-TU 
S n-t2,f 

where x2 is a positive or zero integer delay and oc' ni 
i s a noise overestimation coefficient. This 
overestimation coefficient a' n ,i may be dependent on or 
independent of the frame index n and/or the subband 
index i. In a preferred embodiment, it depends both on 
n and i, and it is determined as described in document 
W099/14737. A first denoising is performed in step 181: 
ip n ,i = Hp n ,i.S a ,i. In steps 182 to 184, the spectral 
components Ep lrn<i are calculated according Ep 1#afi = max 
(Ep n#i :6li. B n . tl ,i) , and in steps 182 to 184, the spectral 
components ip 2 ,n,i are calculated according to Ep 2 , n<i = 
max(Ep n ,i:J52i. B n _ Tlji ) 

The voice activity detector 15 of Figure 2 
comprises a module 19 which calculates energies of the 
denoised versions of the signal ipi, n ,i and Ep 2 , nji 
respectively lying in m frequency bands designated by 
the index j(l<j<m, m>l). This resolution may be 
the same as that of the subbands defined by the module 
12 (index i) , or a finer resolution of possibly as much 
as the whole of the useful band [0, F e /2] of the signal 
(case m = 1) . By way of example, the module 12 can 
define I = 16 uniform subbands of the band [0, F e /2] , 
and the module 19 can retain m = 3 wider bands, each 
band of index j covering the subbands of index i 
ranging from imin(j) to imax(j), with imin(l) = l, 
imin(j+l) = imax(j) + l for 1 < j < m, and 
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imax(m) =1. In step 190 (Figure 3), the module 19 
calculates the energies per band: 

imax(j) 
imaxQ) 

E 2,nJ = X N-fCM)}Ep2 n . 
i*imin(j) 



A module 20 of the voice activity detector 15 
performs a temporal smoothing of the energies Ei, n ,j and 
E 2 , n ,j for each of the bands of index j , this 
corresponding to steps 200 to 205 for Figure 4. The 
smoothing of these two energies is performed by means 
of a determined smoothing window by comparing the 
energy E 2 , n ,j of the most denoised version with its 
previously calculated smoothed energy E 2 , n _ lfj , or with a 
value of the order of this smoothed energy E 2 , n -i j , 
(tests 2 00 and 2 01) . This smoothing window can be an 
exponential window defined by a forget factor A. lying 
between 0 and 1. This forget factor X can take three 
values: the one X r very close to 0 (for example X r = 0) 
chosen in step 2 02 if E 2 , n ,j < E 2 , n . 1(j ; the second X q very 
close to l (for example k q =0.99999) chosen in step 203 
if E 2 ,n,j > A E 2 , n -i,j, A being a coefficient bigger than 
1; and the third X p lying between 0 and X q (for example 
Xp = 0.98) chosen in step 204 if E 2 , n -i,j < E 2 , n -i,j < A 
E 2 ,n-i,j. The exponential smoothing with the forget 
factor X is then performed conventionally in step 2 05 
according to: 

5.n.J ~^El.n-l.j +(1-UE 1nj 

An exemplary variation over time of the 
energies Ei^j and E 2#n(j and of the smoothed energies 
Ei, n ,j and E 2 , n ,j is represented in Figure 5. It may be 
seen that good tracking of the smoothed energies is 
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achieved when the forget factor is determined on the 
basis of the variations in the energy E 2 n j 
corresponding to the most denoised version of the 
signal. The forget factor A. p makes it possible to take 
into account the increases in the level of the 
background noise, the energy reductions being tracked 
by the forget factor A. r . The forget factor A, g very close 
to 1 means that the smoothed energies do not track the 
abrupt energy increases due to speech. However, the 
factor X q remains slightly less than 1 so as to avoid 
errors caused by an increase in the background noise 
which may arise during a fairly long period of speech. 

The voice activity detection automaton is 
controlled in particular by a parameter resulting from 
a comparison of the energies Ei, n ,j and E 2 , n ,j. This 
parameter can in particular be the ratio 
d n ,j = E 1/n ,j/E 2 ,n, j • It may be seen in Figure 5 that this 
ratio d n ,j allows proper detection of the speech phases 
(represented by hatching) . 

The control of the detection automaton can also 
use other parameters, such as a parameter related to 
the signal-to-noise ratio: snr n ,j = E lin , j/ E j , this 
amounting to taking into account a comparison between 
the energies Ei, n#j and E 1#n>j . The module 21 for 
controlling the automata relating to the various bands 
of index j calculates the parameters d n ,j and snr n>j in 
step 210, then determines the state of the automata. 
The new state 8 n ,j of the automaton relating to band j 
depends on the previous state S n -i,j, on d n>j and on 
snr n/j , for example as indicated in the diagram of 
Figure 6 . 

Four states are possible: 5j = 0 detects 
silence, or absence of speech; 8j = 2 detects the 
presence of voice activity; and the states 5j = 1 and 
Sj = 3 are intermediate states of ascent and descent. 
When the automaton is in the silence state (6 n -i,j = 0) , 
it remains there if d nJ exceeds a first threshold alj , 
and if it switches to the ascent state in the converse 
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case. In the ascent state (5 n - lrj = 1) , it returns to the 
silence state if d n>j exceeds a second threshold a2 j; 
and it switches to the speech state in the converse 
case. When the automaton is in the speech state (8 n _ 
i,j = 2) , it remains there if snr n ,j exceeds a third 
threshold a3j, and it switches to the descent state in 
the converse case. In the descent state (8 n _ 1;j =3), the 
automaton returns to the speech state if snr n>j exceeds 
a fourth threshold a4j, and it returns to the silence 
state in the converse case. The thresholds al j# a2 jf 
a3j, and a4j may be optimized separately for each of the 
frequency bands j . 

It is also possible for the automata relating 
to the various bands to be made to interact by the 
module 21. 

In particular, it may force each of the 
automata relating to each of the subbands to the speech 
state as soon as one among them is in the speech state. 
In this case, the output of the voice activity detector 
15 relates to the whole of the signal band. 

The two appendices to the present description 
show a source code in the C++ language, with a 
fixed-point data representation corresponding to an 
implementation of the exemplary voice activity 
detection method described hereinabove. To embody the 
detector, one possibility is to translate this source 
code into executable code, to record it in a program 
memory associated with an appropriate signal processor, 
and to have it executed by this processor on the input 
signals of the detector. The function 

a_priori_signal _power presented in appendix 1 
corresponds to the operations incumbent on the modules 
18 and 19 of the voice activity detector 15 of Figure 
2. The function voice_activity_detector presented in 
appendix 2 corresponds to the operations incumbent on 
the modules 20 and 21 of this detector. 

In the particular example of the appendices, 
the following parameters have been employed: xl =1; 
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X2 = 0; Sli = 0.3 ; &2± = 0.001 ; m = 3 ; A = 4.953; 
X p = 0.98; X q = 0.99999; X r = 0 ; alj = a2j = a4j = 1.221; 
a3j = 1.649. Table 1 hereinbelow gives the 
correspondences between the notation employed in the 
above description and in the drawings and that employed 
in the appendix. 
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In the variant embodiment illustrated by Figure 
7, the denoising module 25 of the voice activity 
detector 15 delivers a single denoised version Ep nji of 
the speech signal, so that the module 26 calculates its 
energy E 2 , n>;i for each band j. The other version, in 
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which the module 26 calculates the energy, is 
represented directly by the non-denoised samples S n>i . 

As before, various denoising processes may be 
applied by the module 25. In the example illustrated by 
5 steps 250 to 256 of Figure 8, the denoising is done by 
nonlinear spectral subtraction with a noise 
overestimation coefficient dependent on a quantity p 
related to the signal-to-noise ratio. In steps 250 to 
252, a preliminary denoising is performed for each 
10 subband of index i according to ; 

S nfi =max(s rit[ -cc.B n _- u ;0.B n _. M ), 

the preliminary overestimation coefficient being for 
example a = 2, and the fraction S possibly 
corresponding to a noise attenuation of the order of 
15 10 dB. 

The quantity p is taken equal to the ratio 
S'n,i/S n ,i in step 253. The overestimation factor f (p) 
varies in a nonlinear manner with the quantity p, for 
example as represented in Figure 9 . For the values of p 
20 closest to 0 (p < p 2 ) , the signal-to-noise ratio is low, 
and it is possible to take an overestimation factor 
f (p) =2. For the highest values of p (p 2 < p < 1) , the 
noise is weak and need not be overestimated (f(p) = 1). 
Between p x and p 2 , f (p) decreases from 2 to 1, for 

2 5 example linearly. The denoising proper, providing the 

version Ep n/i , is performed in steps 254 to 256: 

E PnJ * max(s nJ ~ f<p).6 nMfi ; 3.B n _ 1it ). 

The voice activity detector 15 considered with 
reference to Figure 7 uses, in each frequency band of 

3 0 index j (and/or in full band) , a detection automaton 

having two states, silence or speech. The energies E 1)Ilr j 
and E 2 , n ,j calculated by the module 26 are . respectively 
those contained in the components S n<i of the speech 
signal and those contained in the denoised components 
35 Ep nji calculated over the various bands as indicated in 
step 260 of Figure 8. The comparison of the two 
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different versions of the speech signal pertains to 
respective differences between the energies E 1>njj and 
E 2 , n ,j and a lower bound of the energy E 2 , n ,j of the 
denoised version. 
5 This lower bound E 2min> j can in particular 

correspond to a minimum value, over a sliding window, 
of the energy E 2 , n ,j of the denoised version of the 
speech signal in the frequency band considered. In this 
case, a module 2 7 stores in a memory of the first -in 
10 first-out type (FIFO) the L most recent values of the 
energy E 2rI1 ,j of the denoised signal in each band j, over 
a sliding window representing for example of the order 
of 2 0 frames, and delivers the minimum energies 
E2min,j = min E 2 , n -k,j over this window (step 270 of 
0<k<L 

15 Figure 8) . In each band, this minimum energy E 2mill( j 
serves as lower bound for the module 2 8 for controlling 
the detection automaton, which uses a measure Mj given 

by - M ( S tep 280) . 

The automaton can be a simple binary automaton 
2 0 using a threshold A d , possibly dependent on the band 
considered: If Mj > Aj , the output bit 8 n ,j of the 
detector represents a silence state of the band j , and 
if Mj < Aj, it represents a speech state. As a variant, 
the module 28 could deliver a nonbinary measure of the 

2 5 voice activity, represented by a decreasing function of 

Mj. 

As a variant, the lower bound E 2m i n ,j used in 
step 280 could be calculated with the aid of an 
exponential window, with a forget factor. It could also 

3 0 be represented by the energy over band j of the 

quantity £. B n - lj;L serving as floor in the denoising by 
spectral subtraction. 

In the foregoing, the analysis performed in 
order to decide on the presence or absence of voice 
35 activity pertains directly to energies of different 
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versions of the speech signal. Of course, the 
comparisons could pertain to a monotonic function of 
these energies, for example a logarithm, or to a 
quantity having similar behavior to the energies 
according to voice activity (for example the power) . 
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APPENDIX 1 
/ 

* description 

* NSS module: 

* signal power be for* VAD 



included files 



*/ 

#include <a3sert.h> 
tlnclude "private. h" 



private 



V 

Word32 pow*r{Hordl€ aodule, Wordl6 beta, WordlS thd, Wordl6 val); 

* a_priori_signaljaower 
-/ 

void a priori signal_power 
( 

/* IN "/ Wcrdlfc 'Z, Word.16 *intexnal_5tat«, Wordl6 *max_noise, W 
ord!6 *iono_term_noise, ~ 
~ wordie »frequential_scale, 

/* INSOUT */ wordlS *alpha, 

/* OCT */ Hord32 «ord32 *52 

> 

{ 

int vad; 

for {vad » 0; vad < param. vad^number; vad++) { 

int start • param. vads [vadj .first_subband_for_power; 
int scop - param. vads [vadl .last_subband; " ~ 
int subband; ~ 
int unifor»_3ubband; 

uniform_5UbDand - 1; 
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fox (subband * start.* subband <=* stop; subband++) 

if (param.subband_size [subband] !-» param. subband sizefstart] 

) 

uni£orrnj*ubband - 0; 

eitvad] - 0; move32(); 
P2[vad] - 0; mov«32C) ; 

testO; if (aub(intenul_statelva<£I, NOISE) — o> { 
for (subband » stare; subband <- stop; subband>+ ) { 
Word32 pwr; 
WordlS shift; 
WordlB module ; 
WordlS alphajlong^terra; 

alpha_long_term. ~ shr (roax_noise E subband) , 2); mov«16(); 
test(T; testO; if (sub (alpha_long term, long term noiseC 
subband]) >- 0) { 

alpha t subband] » 0x7fff> movel6{); 

alpha_long_term = long_term_noise [subband] ; movel6{); 
J else if(subTmax noise [subband] , long term noise (subban 

d]) < 0) { 

alpha [subband] - 0x2000; raov«16(); 

alpha long term * shr (long_term noise [subband] , 2) ; mo 
} else { 

alpha (subband] «> div_s (alpha_long_term, long_term_noi 
sa [subband] ) ; movel€ ( ! ; 
} 

module - sub (E[ subband I , a hi (alpha long term, 2>>; roovel 

60 ; 

if (unif orra_subband) { 

shift - shUrrequential scale [subband] , 1); movalS<); 
} else ( 

shift - add (param.3ubband_shift [subband] , shKfrequen 
tial scale [subband] , 11); toovel€{); 
} 

pwt - power (module, param.beta_a__prioril, long_term_nois 
a [subband], long_tenn_noise [subband] > ; ~ 
pwr «» L shr'tpwr, shift) ; 
Pl[vad]~- L_add(Pl[vad], pwr); $aove32(); 

pwr » power (module, param.beta_a_priori2, long_term_nois 
e[3ubband], longj:erm_noise [subband] ) ; ~ ~ 

pwr = 1. shr (pwr, shift); 
P2[vad)~- L add{F2[vad], pwr); move32 O ; 

} 

} else ( 

for (subband ■ start; subband <» stop; subband-t-+> { 
Word32 pwr; 
WordlS shift; 
Wordl6 module; 
Wordl6 alpha_long_term; 

alphaJ.ong_term - mult (alpha [subband], long_ter»_noise[s 
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ubbandj ) ; move 16 {) ; 

6£} . module = sub<£ [subband] , shl(alpha_lono_term, 2)); movel 

if {uniform_subbandJ { 

shift = shl Ifrequential scale [s ubbandj , 1); moveI6(); 
I els« | 

shift - addfparam.subband shift [subbandl , shiffrequen 
tial scale [subband] , raovel6(); 
1 

pwr = power (module, param.beta a prioril, long term nois 
e[aubbarvdj, E[subbandJ); 

pwr - L_shr(pwr, shift) ; 

Pl[vad} = L_add(Pl(vad} , pwr),- move32f); 

pwr - powerimodule, param.beta a priori2, long term nois 
etsubband], E{subhand)); ~ 
pwr =- Whrfpwr, shift); 
P2[vad] - L_add(P2[vad] , pwr); move32(); 



V 

Word32 power (Wordl6 module, wordl6 beta, WordlS thd, «ordl6 val) 
Word32 power; 

test{); if (sub (module, mulubata, thd)) <«* 0) t 
Wordl6 hi, lo; 

power = I,_aiult.(val, val> ; move 32 Of 

L_Sxtract (power, Shi, 51o); 

power - Mpy_32_lS(hi, lo, beta); nove32{); 

L_Extract (power, thi, slo) ; 
power - Mpy 32 16 (hi, lo, beta); oove32[); 
) else { 

power » L_mult (module, module); move32(); 
return (power) ; 
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APPENDIX 2 



, 

" description 

* NSS module: 

* VAD 



Ifinclude <asaert.h> 
ffinclude "private. h" 
finclude "simutool.h" 



included files 



»define DELTA_P 
tdeffin* D noise 
#defin* D~SIGKAL 
*define SNP-_SIGNAL 
♦define SNP. NOISE 



(1.5 * 1024) 
(.2 * 1024} 
(.2 • 1024) 
(.5 * 1024) 
(.2 * 1024) 



voice_aetivity_detector 



void voice_activicy detactor 

( 

/• IN */ Word32 *P1, Word32 *P2, Wordl6 £rame_count*£, 

/* iNsocrr */ Word32 *Pla, Word32 *P2s, Wordl6 *intemal_scate, 

/* OUT -/ Wordl6 *state 

) 
{ 

int vad; 
int signal; 
int noise; 
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signal - 0; mo*el60; 
noise » 1; movel6{); 

for(vad * 0; vad < param. vad_number; v&d++) { 
WordlS snr, d: 
WordlS logPl, logPls; 
WordlS logP2, logP2s; 



lQgP2 - logfix(P2(vadJ) : novels O; 
logP2s - logfix(P2s[vad]) ; movel6{); 

teat (J ; if (L_»ub(P2[vad], P2a(vad]) > 0} ( 
wordl6 hil, lol; 
WordlS hi2, lo2; 

L_Extract{L_3ub{Pl(vadJ, PlsfvadJ), shil, filol) ; • 
L_Excract (L_3ub{P2[vad] , P23{vadJ), thi2, &lo2); 

testU; if (5ub(sub(logP2, logP2s}, DELTA P) < Q) I 

Pis (vad] - L add{Pls[vad] , L shr (Mpy 32 16 (hil, lol, 0x6 

655) , 4}); aove32(); " - _ _ 

P2stvad] - L add(e2a(vad] , L siirfMpy 32 I6(hi2, lo2, 0x6 

656) , 4)); mo-7e32(); . ~ ~ ~ 

} else { 

PU[vad] » L add(Pl3(vad], L shr (Mpy 32 16 (hil, lol, OxS 
8do) , 13) } ; move32 () ,- 

P2a[vad] « L add{P2a [vad] , I> 3hx<Mpy_32_16{hi2, lo2, OxS 
8db), 13)); 00*632 <); 
} 

> els* { 

Pis [vad] - Pi £ vad],- move32[>; 
P2s[vad] i P2[vad]; move32(); 



logPl - logfix(PKvadl) mov«16(); 
logPis - log£ix(Pls[vad]>; novel6{},- 



d - sub(logPl, logP2); movelS(>; 
snr - subdogPl, logPls) ; aovel6(); 

ProbeFixlSCd", &d, 1, 1.); 
Probefixl6("_3nr", fcsnr, 1, l.>; 

Wordl6 pp; 

ProbeFixl6{"pl B , tlogPi, 1, 1.); 
ProbeFixlS C*p2". slogP2, 1, 1.).- 
ProbeFixl6( H pls", tlogPla, 1, 1.); 
ProbeFixlS ("pas", &lagP2s, 1, 1.); 
pp - logP2 - logP2s; 

ProbeFixlS ("dp", 4pp, 1, 1.); 
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test{); if (sub (internal state [vad], NOISE) — 0) 

goto LABELJTOISE; 
testO; if (sufc(intemal_5tate[vadj , ASCENT) — 0) 

goto LASEL_ASCEHT; 
t*stu; if (su5(interaai_stateivad] , signal) — 0) 

goto LASEL_5IGNAL; 
testO; if (sub(intarnal_stata{vad] , DESCENT) »= 0) 

goto LABEL_D£5CENT ; 

LAB£L_NOISE: 

nest<7; if (sab <d, D_NOISE) < 0) { 

internal state [vad] » ASCENT; novels <); 

} 

goto LA8EL_END_VAD; 
LABEL_ASCENT: 

test (7; if (sub (d, D SIGNAL) < 0) { 

lnternal_3tate(vad] - SIGNAL; movel6(); 
signal « 1; movelSO ; 

noise •* 0; novels (); 
) else { 

internal state[vad] - NOISE; movel6{); 

) 

goto LABEL_END_VAD; 
LABEL_SIGNAL : 

tastd; if(aub(snr, SNR_SIGNAL) < 0) { 

internal state[rad] - DESCENT t novelSU; 
} else { 

signal - 1; novelet); 

) 

noise * 0; movelSl); 
goto LASEL_EKD_VAD; 

LABEL_0£SCENT: 

test(~; if(sub(snr, SNR_NOISE> < 0) { 

internal state [vad] - NOISE; novels C); 
J else { 

interaal_atate[vad] = SIGNAL; aovelSO; 
signal - 1; novels () ; 
noise - 0; movelSO; 

} 

goto IABEL_END_VAD; 

LABEL_END_VAD : 
; ~ 

} 

•state - TRANSITION; movel6(); 
tescC): testO; if (signal !- 0) { 

testO; if (sub(f rame^counter, param.init frame number) >- 0) { 
for (vad - 0; vad < param.vad number; vad+*)~{ 
internal state [vad] - SIGNAL; movel6(); 

) 

•state - SIGNAL; movel6(); 
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) else if (noise !» 0) { 

♦state - NOISE; movelM); 

} 
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1. Method for detecting voice activity in a 
digital speech signal (s) in at least one frequency 
band, characterized in that the voice activity is 
detected on the basis of an analysis comprising a 
comparison, in the said frequency band, of two 
different versions of the speech signal, one at least 
of which is a denoised version obtained by taking 
account of estimates of the noise included in the 
signal . 

2. Method according to claim 1, in which the said 
comparison pertains to respective energies 
(Ei /n ,j, E 2 ,n,j) , evaluated in the said frequency band, of 
the two different versions of the speech signal, or to 
a monotonic function of the said energies. 

3. Method according to claim 1 or 2 , in which the 
said analysis furthermore comprises a temporal 
smoothing of the energy (Ei ia ,j) of one of the said 
versions of the speech signal, and a comparison between 
the energy of the said version and the smoothed energy 

(El.n.j) - 

4. Method according to claim 3, in which the 
comparison between the energy of the said version 
(Ei,n.j) and the smoothed energy (E 1>n;j ) controls the 
transitions of a voice activity detection automaton 
from a speech state to a silence state, whilst the 
comparison of the two different versions of the speech 
signal controls the transitions of the detection 
automaton from the silence state to the speech state. 

5. Method according to any one of claims 1 to 4, 
in which the two different versions of the speech 
signal are two versions denoised by non-linear spectral 
subtraction, a first of the two versions (£pi, nji ) being 
denoised in such a way as not to be less, in the 
spectral domain, than a first fraction of a long- 
term estimate (B n<i ) representative of a noise component 
included in the speech signal, and the second of the 
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two versions (Ep 2 , n ,i) being denoised in such a way as 
not to be less, in the spectral domain, than a second 
fraction (£2j) of the said long-term estimate, smaller 
than the first fraction. 
5 6. Method according to claim 5, in which a 

temporal smoothing of the energy of each of the two 
versions of the speech signal is performed, by means of 
a determined smoothing window by comparing the energy 
(E 2 , n , j) of the second of the two versions with the 
10 smoothed energy (E 2 ,n,j) of the second of the two 
versions . 

7. Method according to claim 6, in which the 

smoothing window is an exponential window defined by a 
forget factor ( X ) . 

15 8. Method according to claim 7, in which the 

forget factor (X) has a substantially zero value (A.r) 
when the energy (E 2 , n ,j) of the second of the two 
versions is less than a value of the order of the 
smoothed energy (E 2 , n ,j) of the second of the two 

2 0 versions. 

9. Method according to claim 8, in which the 
forget factor (X) has a first value (A. q ) substantially 
equal to 1 when the energy (E 2 , n ,j) of the second of the 
two versions is greater than the said value of the 

2 5 order of the smoothed energy multiplied by a 

coefficient (A) bigger than 1, and a second value (X v ) 
lying between 0 and the said first value when the 
energy of the second of the two versions is greater 
than the said value of the order of the smoothed energy 

3 0 and less than the said value of the order of the 

smoothed energy multiplied by the said coefficient. 

10. Method according to any one of claims 5 to 9 , 
in which the first and second fractions (Slj, S2j) 
correspond substantially to attenuations of 10 dB and 

35 60 dB, respectively. 

11. Method according to any one of claims 1 to 10, 
in which the comparison of the two different versions 
of the speech signal pertains to respective differences 



WO 01/11605 



- 23 - 



PCT/FR00/02220 



between the energies (Ei, n ,j, E 2 , n ,-j) °f these two versions 
in the said frequency band and a lower bound (E2min,j) of 
the energy (E 2 , n ,j) of the denoised version of the speech 
signal in the said frequency band. 
5 12. Method according to claim 11, in which one of 

the two different versions of the speech signal is a 
non-denoised version of the speech signal. 

13. Device for detecting voice activity in a speech 
signal, comprising signal processing means (15) 

10 designed to implement a method according to any one of 
claims 1 to 12 . 

14 . Computer program, loadable into a memory 
associated with a processor, and comprising portions of 
code for implementing a method according to any one of 

15 claims 1 to 12 upon the execution of the said program 
by the processor. 

15. Computer medium, on which a program according 
to claim 14 is recorded. 
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